After my last post, Michael contacted me and explained
that I could simplify the import. The
can process JSONL files directly.
Also, with apoc.load.json()
the creation of dynamic relations becomes a lot simpler.apoc.create.relationship()
Identities
Opencorporates posted a blog post
recently detailing that the
in the German coorperates data is not unique. I previously used these as
identifiers for the merge command when importing. On second thought, as each company should only occur once
in the JSONL file, we don’t need to merge. So, one less index to create and to maintain for neo4j.
This, of course, assumes that the import is a one time job.native_company_number
Dealing with identities is always a problem when working with data: As I don’t have any identifying data
for the
, I used their names. These are very likely not unique, leading to different persons
represented as the same officer. For my purposes, this is not a problem. For anything real, these issues
would need to be addressed.:Officer
Loading JSONL
As I learned,
can load JSONL files directly. No need to split and convert the file.
The procedure will turn the JSONL file into a stream as it is loaded. In our case, each line is a JSON object
on its own, so we get a stream of maps.
In order limit the transaction size, we wrap the call to apoc.load.json()
into a call to apoc.load.json()
with the load.json() as first parameter and the cypher query to add/merge the data as second parameter and additional
options as third parameter:apoc.periodic.iterate()
// run the second statement for each item returned by the first statement.
// Returns number of batches and total processed rows
apoc.periodic.iterate('statement returning items', 'statement per item',
{batchSize:1000,iterateList:true,parallel:false,params:{},concurrency:50,retries:0})
YIELD batches, total
Creating dynamic relationships
For the relations between officers and companies, I want to use the
attribute from the json file.
Unfortunately, cypher does not offer a switch or similar statement, so I used a bit of hack involving position
statements
with only one item if a condition was meet.
As Michael pointed out, there is an apoc procedure for that too foreach
apoc.create.relationship()
// create relationship with dynamic rel-type
apoc.create.relationship(fromNode,'KNOWS',{key:value,…}, toNode)
All that is needed is to uppercase the
attribute and substitute spaces, so the final command becomes:position
call apoc.create.relationship(officer, toUpper(replace(o.position, ' ', '_')),
{dismissed:o.other_attributes.dismissed, startDate:date(o.start_date), endDate:date(o.end_date)},company)
yield rel
The final import script is much nicer and loads all data in about 10 minutes:
call apoc.periodic.iterate("call apoc.load.json('http://localhost:8000/de_companies_ocdata.jsonl') yield value as c",
"create (company:Company) set company.id = c.company_number, company.name = c.name,
company.status = c.current_status, company.jurisdictionCode = c.jurisdiction_code,
company.address = c.registered_address, company.registerArt = c.all_attributes._registerArt,
company.registerNummer = c.all_attributes._registerNummer,
company.registerOffice = c.all_attributes.registered_office,
company.nativeCompanyNumber = c.all_attributes.native_company_number,
company.previusNames=[p IN c.previous_names | p.company_name],
company.federalState = c.all_attributes.federal_state,
company.registrar = c.all_attributes.registrar
with c, company UNWIND c.officers as o
merge (officer:Officer {name:o.name}) on create set officer.firstName = o.other_attributes.firstname,
officer.lastName = o.other_attributes.lastname, officer.registeredOffice = c.all_attributes.registered_office
with officer, o, company
call apoc.create.relationship(officer, toUpper(replace(o.position, ' ', '_')),
{dismissed:o.other_attributes.dismissed, startDate:date(o.start_date), endDate:date(o.end_date)},
company) yield rel
return count(officer)
", {})