cdata 0.1.8 2017-08-17 ✔ PY2

cdata on PyPI  

see data, handy snippets for conversion, and ETL.


AuthorLi Ding
LicenseApache 2.0

cdata
-------------

"see data", see data, handy snippets for conversion, cleaning and integration.

install
-------------
pip install cdata


json data manipulation
-------------

* json (and json stream) file IO, e.g. items2file(...)
* json data access, e.g. json_get(...), any2utf8, json_dict_copy
* json array statistics, e.g. stat(...)

.. code-block:: python

from cdata.core import any2utf8
the_input = {"hello": u"世界"}
the_output = any2utf8(the_input)
logging.info((the_input, the_output))


.. code-block:: python
property_list = [
{ "name":"name", "alternateName": ["name","title"]},
{ "name":"birthDate", "alternateName": ["dob","dateOfBirth"] },
{ "name":"description" }
]
json_object = {"dob":"2010-01-01","title":"John","interests":"data","description":"a person"}
ret = json_dict_copy(json_object, property_list)


table data manipulation
-------------

* json array to/from excel

.. code-block:: python

import json
from cdata.table import excel2json,json2excel
filename = "test.xls"
items = [{"first":"hello", "last":"world" }]
json2excel(items, ["first","last"], filename)
ret = excel2json(filename)
print json.dumps(ret)



JSON data from reading a single sheet excel file

.. code-block:: json

{
"fields": {
"00": [
"name",
"年龄",
"notes"
]
},
"data": {
"00": [
{
"notes": "",
"年龄": 18.0,
"name": "张三"
},
{
"notes": "this is li si",
"年龄": 18.0,
"name": "李四"
}
]
}
}

web stuff
-------------

* url domain extraction

entity manipulation
-------------

* entity.SimpleEntity.ner()

.. code-block:: python

from cdata.entity import SimpleEntity
entity_list = [{"@id":"1","name":u"张三"},{"@id":"2","name":u"李四"}]
ner = SimpleEntity(entity_list)
sentence = "张三给了李四一个苹果"
ret = ner.ner(sentence)
logging.info(json.dumps(ret, ensure_ascii=False, indent=4))
"""
[{
"text": "张三",
"entities": [
{
"@id": "1",
"name": "张三"
}
],
"index": 0
},
{
"text": "李四",
"entities": [
{
"@id": "2",
"name": "李四"
}
],
"index": 4
}]
"""

* region.RegionEntity.guess_all()

.. code-block:: python

from cdata.region import RegionEntity
addresses = ["北京海淀区阜成路52号(定慧寺)", "北京大学肿瘤医院"]

city_data = RegionEntity()
result = city_data.guess_all(addresses)
logging.info(json.dumps(result, ensure_ascii=False))
"""
{"province": "北京市",
"city": "市辖区",
"name": "海淀区",
"district": "海淀区",
"cityid": "110108",
"type": "district"}
"""

wikification
-------------

* 通过wikidata搜索,定位对应实体,查找实体中文名,别名等属性。wikidata_search (item/property) and wikidata_get

.. code-block:: python

query = u"居里夫人"
ret = wikidata_search(query, lang="zh")
logging.info(ret)

nodeid = ret["itemList"][0]["identifier"]
ret = wikidata_get(nodeid)
lable_zh = ret["entities"][nodeid]["labels"]["zh"]["value"]
logging.info(lable_zh)


misc
-------------

* support simple cli function using argparse


notes
-------------
release package using https://github.com/pypa/twine