Commit b984e520 authored by Dorian's avatar Dorian
Browse files

writing data in sqlite db

parent 5038a0f6
......@@ -6,80 +6,20 @@ Information on the API can be found here:
* https://data.gov.be/fr/dataset/01593a26-ed57-498e-bec0-13011a75a773
* https://api.brussels/store/apis/info?name=fixmystreet&version=1.0.0&provider=admin
## data structures
If you want to run the script you'll need:
* python3
* sqlite3 (```sudo apt install sqlite3```)
There is two subcategories of potelet: *dommaged* and *missing*.
There is respectively *4515* and *1833* items in each.
For an idea the whole website contains *129781* items.
<!-- We can only make a request by category if we send a subcategory with no subcategory.
So we have to make a different request for every subcategory of potelet. -->
<!-- We can get all the potelet in every of those category with one call. -->
For every potelet there is an **attachments** thread/section, that contains the *pictures* and *comments*.
Those are like a thread of different actors interacting by posting images or texts, every attachment has a date of publication.
<!-- We have to do a request for every potelet to get the attachments and iterate through it. -->
**Note:** extracting all the potelets and write them in the db with the python script takes approximatelly 4h.
For every potelet there is an **history** thread/section, that contains its changes of status.
The status changes can give us information on how much times did it took before a responsible decided to take in charge the incident. This also contains the comments that were not the creation of the incident as *updates*.
```python3 db/extrac.py```
For every entries in the attachment or history section, there is an **actor**. Those actors are either:
* *CITIZEN* (users with their private data hidden),
* *PROFESSIONAL* (a municipality or organisation that accepts to 'fix' the incident)
* *SYSTEM* (Région Bruxelles-Capitale).
This makes a structure of 4 tables plus an image folder.
![alt text](datastructure.png "Title")
A special case is the **duplicate** [à completer]
## stats on images
```
Total number of Potelet: 6349
Total number of Images: 6727
Max number of images by potelet: 23
Average number of images by potelet: 1.0595369349503858
```
* Most of potelet - around 77% - have 0 or 1 image.
* Some - around 22% - have between 2 and 6 images.
* Then only 0.003% have 7 images or more.
```
0: 1867
1: 3079
2: 930
3: 295
4: 96
5: 41
6: 20
7: 9
8: 5
10: 2
11: 1
12: 2
13: 1
23: 1
```
## classification and ordering possibilities
### classification
* separate by **categories**: *dommaged* and *missing*.
* separate by **by responsible/process** (if there is): *responsible organisation* and/or *responsible departement*, and by status: *CREATED*, *PROCESSING*, *CLOSED*
* separate by **location**: *municipality*.
* separate by **type of location**: analysis on the picture and/or location to detect if it's a crossection, a small street, etc.
### ordering
* order by **creation date**
* order by **last updated date**
* order like **a walk** that **minimise** the change of location and/or time.
## other notes
For every incident there is already an automaticaly generated .pdf everyone can download on the website.
We can get it with the API: ```https://fixmystreet.brussels/api/incidents/{id}/pdf?lang={lang}&addressLang={addressLang}&type={type}```
A data that is important but doesn't appear in the individual information about potelet incidents is **the macro repartition** of the incident on a map with it's *cluster* and *hole*.
This could be explicitly represented in the final object if there is one section by municipality and some are nearly empty while others are big.
To look at db:
* ```sqlite3```
* ```.open db/potelets.db```
* ```.header on```
* ```.mode column```
* ```select * from potelets;```
* ```select * from attachments where potelet_id=265057;```
* ```select * from history where potelet_id=265057;```
import os
import time
import json
import requests
import math
import sqlite3
from sqlite3 import Error
# TODO:
# detect duplicates
db_file = 'potelets.db'
url = 'http://fixmystreet.brussels/api/'
ucat = 'categories'
......@@ -15,7 +19,161 @@ mobilierurbain_catid = 1007
potelet_catid = 2030
# ratio for the number of items got by request
itemsbypages = 12
itemsbypages = 24
poteletbycategory = 48
#---- DB WRITING ----
# TODO:
# actors and actors id in a separate table
# date as date sql field?
# order of iteration/construction for the table?
def empty_file(file):
""" delete file if exists and create a new empty one
"""
try:
os.remove(file)
except OSError:
pass
#create a new one empty
with open(file, 'w') as file:
pass
def create_connection(db_file):
""" create a database connection to a database that resides
in the memory
"""
conn = None;
try:
conn = sqlite3.connect(db_file)
return conn
except Error as e:
print(e)
return conn
def create_table(conn, table_sql):
""" create a table from the table_sql statement
"""
try:
c = conn.cursor()
c.execute(table_sql)
except Error as e:
print(e)
def init_poteletsDB(conn):
""" create the tables for the potelets db
"""
potelets_table = """ CREATE TABLE IF NOT EXISTS potelets (
id integer PRIMARY KEY,
status text,
subcat text,
adress text,
coordinates text,
creationDate text,
updatedDate text
); """
attachments_table = """ CREATE TABLE IF NOT EXISTS attachments (
id integer PRIMARY KEY,
potelet_id integer,
date text,
type text,
content text
); """
history_table = """ CREATE TABLE IF NOT EXISTS history (
id integer PRIMARY KEY,
potelet_id integer,
date text,
type text
); """
create_table(conn, potelets_table);
create_table(conn, attachments_table);
create_table(conn, history_table);
def addPotelet(conn, potelet):
# basic data
id = potelet['id']
status = potelet['status']
subcat = potelet['category']['category']['nameEn']
coordinates = json.dumps(potelet['location']['coordinates'])
adress = (potelet['location']['address']['streetNameFr'] + ' ' +
potelet['location']['address']['streetNumber'] + ', ' +
potelet['location']['address']['postalCode'])
creationDate = potelet['creationDate']
updatedDate = potelet['updatedDate']
print('Potelet id: ' + str(id))
print('Status: ' + status)
print('Category: ' + subcat)
print('Adress: ' + adress)
print('creation date: ' + creationDate)
print('updated date: ' + updatedDate)
# others data (not always present)
# duplicates = potelet['duplicates']
# resolved = potelet['declaredResolved']
# occurence = potelet['severalOccurrence']
# third_party = potelet['thirdParty']
# ext_id = potelet['externalId']
# priv_location = potelet['privateLocation']
# ADD ACTORS RESP ORG / DEP
# add it to potelets table
value_list = [id, status, subcat, adress, coordinates, creationDate, updatedDate]
sql = ''' INSERT INTO potelets(id,status,subcat,adress, coordinates, creationDate,updatedDate)
VALUES(?,?,?,?,?,?,?) '''
cur = conn.cursor()
cur.execute(sql, value_list)
conn.commit()
def addAttachment(conn, attachment):
id = attachment['id']
potelet_id = attachment['incidentId']
date = attachment['creationDate']
type = attachment['type']
content = None;
if type=='PICTURE':
content = attachment['_links']['content']['href']
elif type=='COMMENT' or type=='SYSTEM_COMMENT':
content = attachment['content']
else:
print("ERROR: attachment of unkown type " + type)
print('• ' + date + ' | add actor: ' + content)
# add it to attachments table
value_list = [id, potelet_id, date, type, content]
sql = ''' INSERT INTO attachments(id,potelet_id,date,type,content)
VALUES(?,?,?,?,?) '''
cur = conn.cursor()
cur.execute(sql, value_list)
conn.commit()
def addStory(conn, story):
id = story['id']
potelet_id = story['historizedEntityId']['id']
date = story['historyDate']
type = story['historyType']
print('• ' + date + ' | add actor: ' + ': ' + type)
# add it to story table
value_list = [id, potelet_id, date, type]
sql = ''' INSERT INTO history(id,potelet_id,date,type)
VALUES(?,?,?,?) '''
cur = conn.cursor()
cur.execute(sql, value_list)
conn.commit()
#---- API EXTRACTING ----
# TODO:
# detect duplicates
def getPoteletCat():
# get the potelet category json object with their subcategory
......@@ -64,15 +222,15 @@ def getPotelets(number_limit=0):
print('')
return potelets
def getAttachments(id):
# get the attachments list (COMMENTS and PICTURES) of a potelet with its id
def getAttachments(potelet):
# get the attachments list (COMMENTS and PICTURES) of a potelet
url_attachments = potelet['_links']['attachments']['href']
attachments = requests.get(url_attachments, headers = headers).json()
#sometimes it's just an empty dict, we transform it into a list...
attachments = attachments['response'] if attachments else []
return attachments
def getHistory(id):
def getHistory(potelet):
# get the history list (changes of status) of a potelet with its id
url_history = potelet['_links']['history']['href']
history = requests.get(url_history, headers = headers).json()
......@@ -80,16 +238,18 @@ def getHistory(id):
return history
def getActorFromList(potelet):
''' get the responsible organisation and departement
always assigned to an incident '''
""" get the responsible organisation and departement
always assigned to an incident
"""
actor_corp = potelet['responsibleOrganisation']['nameEn']
actor_team = potelet['responsibleDepartment']['nameEn']
actor = actor_corp + ' // ' + actor_team
return actor
def getActorFromAttachment(attachment):
''' get the organisation or citizen who made the attachment
it's always only an orga (PROFFESSIONNAL or SYSTEM), but no info of the department '''
""" get the organisation or citizen who made the attachment
it's always only an orga (PROFFESSIONNAL or SYSTEM), but no info of the department
"""
actor = ''
actor_type = attachment['reporter']['type']
if actor_type != 'CITIZEN':
......@@ -104,10 +264,11 @@ def getActorFromAttachment(attachment):
return actor
def getActorFromHistory(story):
''' get the organisation and departement who made the story
we have never an id from there :-(
if it's SYSTEM then there is no department precised
those are never CITIZEN'''
"""" get the organisation and departement who made the story
we have never an id from there :-(
if it's SYSTEM then there is no department precised
those are never CITIZEN
"""
actor = ''
actor_type = story['information']['actorType']
if 'corporation' in story['information']:
......@@ -124,72 +285,45 @@ def getActorFromHistory(story):
if __name__ == '__main__':
start_time = time.time()
print('~!~ POTELETS ~!~')
print('Total number of incidents: ' + str(getNumberOfIncidents()[0]))
print('')
potelet_cat = getPoteletCat()
# print(json.dumps(potelet_cat, indent=2))
# --- CREATE DB
potelets = getPotelets(12)
# print(json.dumps(potelets, indent=2))
empty_file(db_file)
print('New DB file created')
conn = create_connection(db_file)
print('Connection established to DB file')
init_poteletsDB(conn)
print('Tables initialised')
print('')
actors = {}
# img_list = []
# --- EXTRACT AND FILL DB
potelet_cat = getPoteletCat()
potelets = getPotelets(poteletbycategory)
for potelet in potelets:
# img_list += [0]
#--- header
id = potelet['id']
status = potelet['status']
subcat = potelet['category']['category']['nameEn']
adress = (potelet['location']['address']['streetNameFr'] + ' ' +
potelet['location']['address']['streetNumber'] + ', ' +
potelet['location']['address']['postalCode'])
creationdate = potelet['creationDate']
updateddate = potelet['updatedDate']
actor= getActorFromList(potelet)
print('Potelet id: ' + str(id))
print('Status: ' + status)
print('Category: ' + subcat)
print('Adress: ' + adress)
print('responsible: ' + actor)
print('creation date: ' + creationdate)
print('updated date: ' + updateddate)
# --> those are in the history!
attachments = getAttachments(id)
if attachments:
print('---[ attachments ]---')
for attachment in attachments:
actor = getActorFromAttachment(attachment)
date = attachment['creationDate']
if attachment['type']=='PICTURE':
img = attachment['_links']['content']['href']
# img_list[-1] += 1
print('• ' + date + ' | ' + actor + ': ' + str(img))
elif attachment['type']=='COMMENT':
comment = attachment['content']
print('• ' + date + ' | ' + actor + ': ' + comment)
#--- history
history = getHistory(id)
if history:
print('---[ history ]---')
for story in history:
actor = getActorFromHistory(story)
date = story['historyDate']
type = story['historyType']
print('• ' + date + ' | ' + actor + ': ' + type)
print('')
# #-------------
# print(json.dumps(actors, indent=2))
# print(img_list)
# print( 'Total number of Potelets: ' + str(len(img_list)))
# print( 'Total number of Images: ' + str(sum(img_list)))
# print( 'Max number of images by potelet: ' + str(max(img_list)))
# print( 'Average number of images by potelet: ' + str(sum(img_list) / len(img_list)))
addPotelet(conn, potelet)
attachments = getAttachments(potelet)
if attachments:
print('---[ attachments ]---')
for attachment in attachments:
addAttachment(conn, attachment)
history = getHistory(potelet)
if history:
print('---[ history ]---')
for story in history:
addStory(conn, story)
print('')
conn.close()
print('process finished in: ' + str(time.time() - start_time) + ' seconds')
File added
# Potelet / Paaltje / Bollard - Documentation
## data structures
There is two subcategories of potelet: *dommaged* and *missing*.
There is respectively *4515* and *1833* items in each.
For an idea the whole website contains *129781* items.
<!-- We can only make a request by category if we send a subcategory with no subcategory.
So we have to make a different request for every subcategory of potelet. -->
<!-- We can get all the potelet in every of those category with one call. -->
### attachments
For every potelet there is an **attachments** thread/section, that contains the *pictures* and *comments*.
Those are like a thread of different actors interacting by posting images or texts, every attachment has a date of publication.
<!-- We have to do a request for every potelet to get the attachments and iterate through it. -->
### history
For every potelet there is an **history** thread/section, that contains its changes of status.
The status changes can give us information on how much times did it took before a responsible decided to take in charge the incident. This also contains the comments that were not the creation of the incident as *updates*.
### actors
For every entries in the attachment or history section, there is an **actor**. Those actors are of three **types**:
* *CITIZEN* (users with their private data hidden),
* *PROFESSIONAL* (a municipality or organisation that accepts to 'fix' the incident)
* *SYSTEM* (Région Bruxelles-Capitale).
For *PROFESSIONAL* and *SYSTEM*, there is two category of actors: *organisation* and *department* (**with different id**)
The departments are always linked to an organisation, giving more precision about who it is.
An actor is automaticaly assigned at the creation of an incident (*organisation* and *department*), even before it is marked *ACCEPTED/PROCESSING*.
<!-- There is no visible pointers of how the *department* and *organisation* are linked, but by iterating through the incidents we can create connections. -->
<!-- For the attachments, they only reference the *organisation*.
For the history, they reference *department* linked to an *organisation*, but their id is not precised. -->
### schemes
This makes a structure of 4 tables plus an image folder.
![alt text](datastructure.png "Title")
A special case is the **duplicate** [à completer]
## stats on images
```
Total number of Potelet: 6349
Total number of Images: 6727
Max number of images by potelet: 23
Average number of images by potelet: 1.0595369349503858
```
* Most of potelet - around 77% - have 0 or 1 image.
* Some - around 22% - have between 2 and 6 images.
* Then only 0.003% have 7 images or more.
```
0: 1867
1: 3079
2: 930
3: 295
4: 96
5: 41
6: 20
7: 9
8: 5
10: 2
11: 1
12: 2
13: 1
23: 1
```
## classification and ordering possibilities
### classification
* separate by **categories**: *dommaged* and *missing*.
* separate by **by responsible/process** (if there is): *responsible organisation* and/or *responsible departement*, and by status: *CREATED*, *PROCESSING*, *CLOSED*
* separate by **location**: *municipality*.
* separate by **type of location**: analysis on the picture and/or location to detect if it's a crossection, a small street, etc.
### ordering
* order by **creation date**
* order by **last updated date**
* order like **a walk** that **minimise** the change of location and/or time.
## other notes
For every incident there is already an automaticaly generated .pdf everyone can download on the website.
We can get it with the API: ```https://fixmystreet.brussels/api/incidents/{id}/pdf?lang={lang}&addressLang={addressLang}&type={type}```
A data that is important but doesn't appear in the individual information about potelet incidents is **the macro repartition** of the incident on a map with it's *cluster* and *hole*.
This could be explicitly represented in the final object if there is one section by municipality and some are nearly empty while others are big.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment