Data Source Reference & Access Guide

Semiconductor Supply Chain Ontology — Data Source Reference & Access Guide

Dustin J Ross & J Bliss Perry | Stanford Graduate School of Business | April 2026

47 Total Sources | 28 Free / GSB Access | 19 Paid / Commercial | 15 Priority 1 Must-Haves


Purpose of This Document

This report identifies, classifies, and prioritizes all data sources required to populate the Project TBD digital twin ontology. For each source, it documents access method, priority tier, and practical acquisition steps. Designed to serve as a primary reference for research, data engineering, and Cowork automation tasks.

How to Read This Report

This report is structured for dual use: human reference and Cowork task input. Each section is self-contained and can be handed to Cowork with a clear prompt.

Access Classification

  • FREE — Publicly accessible at no cost. Download, API, or direct web access.

  • FREE / GSB — Free in general but may have rate limits; GSB library may provide enhanced API access or full datasets.

  • GSB LIBRARY — Available via Stanford GSB library subscriptions (Bloomberg, Refinitiv, Panjiva, FactSet, Capital IQ, etc.).

  • PAID — Commercial license required. Not accessible via GSB. Must be budgeted or negotiated.

  • VARIES — Access depends on specific product tier, API plan, or dataset.

Priority Tiers

  • P1 — MUST HAVE — Core to ontology population. Without these, the digital twin cannot be seeded.

  • P2 — HIGH VALUE — Significantly enriches the graph. Product-to-facility linkages, ownership trees, trade flows. Target in Phase 2.

  • P3 — ENRICHMENT — Dynamic updates, facility verification, event monitoring. Target after core graph is established.


Section 1 — Data Source Identification

Ten categories encompass all data sources relevant to populating the semiconductor supply chain ontology.

1. Corporate Disclosures

The richest free source of supplier dependency, geographic risk, and corporate structure data.

  • SEC EDGAR — 10-K, 20-F, 8-K, Form SD, Proxy statements

  • TWSE — Taiwan-listed filings (TSMC, MediaTek, Novatek)

  • KRX — Korea Exchange filings (Samsung, SK Hynix)

  • Euronext / Deutsche Börse — European filings (ASML, Infineon, STMicro)

  • Company IR websites — Annual reports, investor day decks, earnings transcripts

  • Sustainability / ESG reports — Tier 1–3 supplier lists, conflict minerals (Form SD)

2. Regulatory & Government Lists

The compliance layer of the ontology. All major restricted party lists are free, machine-readable, and frequently updated.

  • BIS Entity List / Unverified List / Denied Persons List

  • UFLPA Entity List (CBP)

  • OFAC SDN List (bulk XML available)

  • DOD Section 1260H List — Chinese military companies

  • CHIPS.gov project tracker — recipient firms, facility commitments

  • FCC equipment authorization database

  • ITC Section 337 filings (EDIS)

  • Japan METI export control lists

  • EU Dual-Use Regulation lists

  • WTO Trade Policy Reviews

  • Taiwan MOEA / Investment Commission

  • South Korea MOTIE — K-Chips Act recipients

3. Trade & Customs Data

Bilateral trade flow data at the HS-code level. HS 8541/8542 (semiconductors) and HS 8486 (equipment).

  • UN Comtrade — HS-code bilateral flows (free tier; enhanced via GSB)

  • USITC DataWeb — U.S. import/export by HS code

  • Census Bureau USA Trade Online

  • Eurostat COMEXT — EU trade flows

  • Panjiva / S&P Global Trade Intelligence — bill of lading, shipper/consignee (GSB)

  • ImportGenius — Asia-U.S. lane coverage (paid)

  • Taiwan Customs Administration

4. Market Intelligence & Industry Reports

Pre-resolved supply chain relationships, market share data, and capacity figures.

  • Gartner — market share by segment, vendor rankings

  • IDC — market sizing, demand by end market

  • IHS Markit / S&P Global Market Intelligence

  • Omdia / Wood Mackenzie — fab capacity databases, capex tracking, node roadmaps

  • VLSI Research — equipment and materials spend by fab

  • Yole Group — packaging, MEMS, power, advanced packaging supply chains

  • TechInsights — teardown data, die-level supplier identification

  • IC Insights / Knometa Research — wafer capacity by fab, company, geography

5. Standards & Industry Bodies

  • SEMI — equipment/materials standards, fab census, industry surveys

  • SIA — annual factbook, market data, policy briefs

  • JEDEC — component standards (product taxonomy reference)

  • GSA (Global Semiconductor Alliance) — fabless/foundry relationship data

  • Responsible Business Alliance (RBA) — supply chain due diligence data

6. Geospatial & Facility-Level Data

  • OpenStreetMap — fab and OSAT facility location

  • Google Maps / Earth — manual facility verification

  • CHIPS.gov — announced U.S. fab investments with location data

  • EPA permit databases — facility confirmation, capacity correlation

  • Planet Labs / Maxar / Satellogic — satellite imagery for construction tracking (paid)

7. Patent & Technology Databases

  • USPTO Patent Full-Text Database

  • EPO Espacenet

  • Google Patents

  • PatSnap / Derwent Innovation — patent analytics, competitive overlap (GSB likely)

8. Financial Data Infrastructure

Corporate hierarchy and subsidiary mapping is essential for entity resolution.

  • Bloomberg Terminal / SPLC function — supply chain relationships (GSB)

  • Refinitiv Eikon / Workspace — corporate hierarchies, supply chain module (GSB)

  • FactSet — supply chain relationships, geographic revenue segments (GSB)

  • Capital IQ — corporate trees, M&A history (GSB)

  • Orbis / Bureau van Dijk — global ownership hierarchies, subsidiary mapping (paid)

9. Academic & Research Literature

  • NBER working papers

  • RAND / CSIS / CNAS reports

  • Stanford SIEPR / GSB working papers

  • arXiv — CS and economics preprints

  • Journal of International Economics, Management Science (GSB library)

10. News & Event Data

  • GDELT Project — structured event extraction from global news

  • DigiTimes — Taiwan supply chain trade publication

  • SemiAnalysis — deep technical supply chain analysis

  • Bloomberg / Reuters / Nikkei Asia

  • Factiva — historical news archive (GSB)


Section 2 — Access Classification

The key insight: the free tier is sufficient to seed the ontology skeleton. Paid sources accelerate velocity by providing pre-resolved relationships.

Fully Free — No Cost, No Credential Required

SEC EDGAR, TWSE/KRX/Euronext filings, BIS Entity List, OFAC SDN List, UFLPA Entity List, DOD 1260H List, CHIPS.gov, FCC database, ITC EDIS, Japan METI lists, EU Dual-Use lists, UN Comtrade (free tier), USITC DataWeb, Census USA Trade Online, Eurostat COMEXT, OpenStreetMap, Google Maps, EPA permits, USPTO/EPO/Google Patents, NBER/RAND/CSIS/CNAS, SIA Factbook, GDELT, SemiAnalysis.

Free or Enhanced via Stanford GSB Library

Bloomberg Terminal (SPLC), Capital IQ, Panjiva, Refinitiv Eikon, FactSet, Gartner/IDC reports, Factiva, UN Comtrade enhanced API, PatSnap/Derwent, academic journals.

Action Required: Audit current GSB library subscriptions against this list before prototype build begins.

Omdia/Wood Mackenzie ($5–15K), TechInsights ($20K+), IC Insights/Knometa ($5–10K), Yole Group, VLSI Research, Orbis/Bureau van Dijk, ImportGenius, Planet Labs/Maxar, DigiTimes ($1–3K/yr).


Section 3 — Priority Order

P1 — Must Have (Seed the Ontology)

  1. SEC EDGAR (10-K, 20-F) — Supplier concentration disclosures, geographic risk factors, subsidiary lists

  2. BIS Entity List + OFAC SDN + UFLPA Entity List — The three regulatory lists defining the compliance screening layer

  3. UN Comtrade (HS 8541/8542/8486) — Country-to-country trade flows for semiconductor products and equipment

  4. USITC DataWeb — U.S.-specific import/export data with more granularity than Comtrade

  5. Bloomberg Terminal — SPLC function — Pre-resolved supplier/customer relationship data

  6. Capital IQ — Corporate Trees — Subsidiary-to-parent mapping essential for entity resolution

  7. CHIPS.gov Project Tracker — Facility-level data on new U.S. semiconductor investments

  8. TWSE / KRX / Euronext Filings — Local exchange filings for TSMC, Samsung, SK Hynix, ASML, Infineon

P2 — High Value (Enrich the Graph)

Panjiva, Refinitiv Eikon, FactSet, Gartner/IDC, Omdia/Wood Mackenzie, Yole Group, TechInsights, USPTO/EPO/Google Patents, ITC Section 337, SEMI fab census, Taiwan MOEA, DOD 1260H, SIA Factbook.

P3 — Enrichment (Dynamic & Verification Layer)

GDELT, SemiAnalysis, Bloomberg/Reuters/Nikkei Asia, DigiTimes, Factiva, OpenStreetMap/Google Earth, EPA permits, Planet Labs/Maxar, RAND/CSIS/CNAS, arXiv/NBER.


Section 5 — Data Acquisition Playbook

Phase A — Free Sources (Do First)

SEC EDGAR

  • Use the sec-edgar-downloader Python library

  • Target: TSMC (20-F), Intel, Qualcomm, Broadcom, Nvidia, Applied Materials, Lam Research, ASML (20-F)

  • Extract Item 1, Item 1A, Item 7 — supplier names, geographic dependencies, facility references

  • Pipe through LLM for SUPPLIES_TO and MANUFACTURES_AT relationship candidates

Regulatory Lists (BIS / OFAC / UFLPA)

  • BIS: CSV download → Entity nodes with SUBJECT_TO_REGULATION edges

  • OFAC: Bulk XML, updates daily; set up cron job

  • UFLPA: PDF — use LLM or pdfplumber to extract

  • Merge into unified restricted-party lookup table

UN Comtrade

  • Register for API key (500 free calls/month)

  • Pull bilateral flows for HS 8541, 8542, 8486

  • Create SHIPS_PRODUCT edges between jurisdiction nodes

CHIPS.gov

  • Scrape project list → Facility and Company nodes

  • Cross-reference against EDGAR and Bloomberg SPLC

Phase B — GSB Library Sources

Action: Email GSB library to confirm access to Bloomberg SPLC, Capital IQ, Panjiva, Refinitiv Eikon, FactSet, and Gartner/IDC. Confirm bulk export permissions.

Bloomberg Terminal — SPLC: Ticker → SPLC → Export. ~2-3 hours per 20 companies.

Capital IQ — Corporate Trees: Company → Tearsheet → Subsidiaries → Export.

Panjiva: Search by HS code + destination → shipper/consignee pairs.

Phase C — Entity Resolution

  • Canonical identifier: LEI (Legal Entity Identifier) via OpenCorporates API

  • Secondary: SEC CIK, Bloomberg ticker, ISIN, DUNS

  • Tooling: OpenRefine for manual, Dedupe.io / Python recordlinkage for automated

  • Validate against Capital IQ corporate trees

Phase D — Paid Sources (Defer or Partner Access)

Defer until post-Series A or partner access. Exception: DigiTimes (~$1-3K/yr) is justifiable early.

  1. Day 1: Download all regulatory lists (BIS, OFAC, UFLPA) → unified restricted-party table

  2. Day 2: EDGAR bulk downloader for top 20 semiconductor companies; extract supplier mentions with LLM

  3. Day 3: UN Comtrade bilateral flows for HS 8541/8542/8486 (2019–2023)

  4. Day 4: GSB Bloomberg SPLC pulls for top 20 tickers; Capital IQ corporate trees

  5. Day 5: Entity resolution — master alias table and LEI mapping

By end of Week 1: Ontology skeleton seeded with ~200 company nodes, 500+ relationship edges, and full compliance layer operational.