Table of contents
Motivation: Roadmap SCQA II - Aug/Sep 2017
Situation: alpha running, no publishers yet, 50 signed up users.
Complication: we don’t have any paying users yet (or know whether we will) and we don’t have a clear plan of action.
Question: what tasks should we focus on in the next 3-6m to create a sustainable business within the next 6-12m?
- what tasks features [user needs] and tasks should we focus on to create a sustainable business
- How do we create a sustainable business?
Hypothesis: Focus on three complementary areas: provide a growing set of core data for consumers using our own tooling to automate this; provide an initial offering to datars focused on frictionless inspecting and sharing; and engage in “lean” market testing to test and validate products offers.
- Consumers - core data Solidify and expand our core dataset offering in order to attract datars and validate consumer hypothesis (and eat our own dog food)
- Develop workflows inside DataHub that make this cheap and automated (esp for datasets that change)
- Marketing / research Put out “tasters” for main business offerings to see who is interested (e.g. enterprise, request data, FD consulting …)
- Do surveys, write up vision and use to inform how we write to people
- Push - datars Focus on datars and make it simple and easy to push and pull data automatically and charge for private datasets plus storage over a certain level
This got reworked - see next item
Roadmap (Sep 2017)
Hypothesis: Focus on three complementary areas:
- Core datasets: Solidify and expand our core datasets using our own tooling to automate this in order to eat our own dog food, attract datars and validate consumer hypothesis
- Develop workflows inside DataHub that make this cheap and automated (esp for datasets that change)
- Marketing / research: engage in lean marketing to test and validate product offers by putting out “tasters” for main business offerings to see who is interested (e.g. enterprise, request data, FD consulting …)
- Do surveys, write up vision and use to inform how we write to people
- [SECONDARY] Push - datars make it simple and easy for datars to push and pull data automatically and charge for private datasets plus storage over a certain level
Detailed Plan
Marketing Roadmap
Less defined. Added retrospectively …
Developer Roadmap
NTS 2017-09-17
- Export formats and showcase – lots of options
- This is an obvious point for external developers to contribute
- => Defining this contribution “endpoint” - creating drivers
- Import + Transform
- Driven: part the core datasets
- Add features needed to support them one by one
- Do by hand and push
- Then infer …
- Driven: part the core datasets
- Interfaces e.g. CLI, Local App, Web interface
- Management features
- Usage tracking
- Privacy
- Teams
- Getting paid
- Integrating payment
- Supporting users
- Marketing
graph LR
subgraph Automated Clean
coreauto[Core Auto v0.1 - 2 pkgs<br/><br/>Write specs by hand & push]
coreautov0.2[Core Auto v0.2 - 10 pkgs]
end
subgraph Export
export1[Export Sqlite, Zip]
export2[Export Node/npm]
export3[Make your own exporter<br/><br/>involve community in<br/>creating them]
end
subgraph Push UI
clipush[CLI - Push<br/>url/Push spec]
cliv0.6["CLI Tidy (minimal)"]
end
factoryv2(Factory v2 Redesign<br/>DAG + Parallelization)
showcase[Showcase v1.1<br/>download UX, R instructions]
preview[Preview large files]
subgraph Tracking
track[Tracking<br/><br/>Storage, B/W, CPU]
trackv0.1[Track usage - storage]
trackv0.2[Track usage - downloads]
trackv0.3[Track usage - cpu]
end
storechanges(Store Every Change)
eventstore(EventStore)
dashboard[Dashboard v1]
validate[Validation<br/>Pipeline Reports]
factoryv2 --> coreauto
coreauto --> coreautov0.2
factoryv2 --> export1
factoryv2 --> preview
export1 --> export2
export2 --> export3
clipush --> cliv0.6
clipush --> coreauto
preview --> showcase
trackv0.2 --> showcase
trackv0.2 --> track
storechanges --> trackv0.1
trackv0.1 --> showcase
trackv0.1 --> track
trackv0.3 --> track
eventstore --> dashboard
classDef donemidnov fill:lightgreen,stroke:#333,stroke-width:2px;
classDef donedec fill:#7BBF6A,stroke:#333,stroke-width:2px;
classDef priority fill:pink,stroke:#333,stroke-width:2px;
class factoryv2,coreauto,preview,trackv0.1,clipush,export1 donemidnov;
class eventstore,coreautov0.2,cliv0.6,showcase,dashboard,validate,storechanges donedec;
Developer …
- EventStore: an index in MetaStore where we store user related events e.g. runs, user sign ins, dataset created, downloads (?)
- Store every change: new model for storing outputs of every flow run in an economical way - see TODO
- Economising on version space …
- Research electron multi-platform builds
Language
- Assembler
- Assembly line vs flow …
- Assembly spec (blueprint?)
Features
Priorities
- Experiment and evaluate with “clean/transform” automation
- Automate: cleaning 5-10 core datasets
- Review the core datasets - do they require regular updating, how does update currently work
- => Product definition on left hand side: UI for a automated wrangling system (that would allow us to compare this to other potential product visions …)
- Export: npm, zip, sqlite, rds
- Fix R, python integration instructions …
- Performance and tracking: work to speed things up etc, track usage, gather metrics about what is working …
- Marketing: …
- CLI
- progress bar …
- don’t reupload if data already there …
- lots more
- Docs
- Comms & marketing
- Pricing - and offer
- Write posts about datasets
- About tools ..
NTS:
Question: how do we connection “pipelines” together (concept of the dataset or project)
More
- set privacy
- change source …
- explicitly set public and private or …?
- support charging
- preview large datasets
- introducing data validation (goodtables lite)
- data statistics: percentage complete https://github.com/ResidentMario/missingno
- Import from URL:
data add {url} --cron=1d
- cron job scraping
- modify / add views
- And the preview: preprocess the transform
- Transforms on data
- => transform language …
- edit title
- debug pipeline
- view the pipeline structure and data at each stage
- push to bigquery
- showcase improvements
- /v/{viewid}.png or /v/{viewid}.html?embed=true
- Why: so you can embed
- view json (?)
- /r/{resource}/{primary-key-id}.[json|html|csv]
- e.g. /data/population/r/annual/2010-us
- /v/{viewid}.png or /v/{viewid}.html?embed=true
Evaluation of Q4 2017 (mid Dec 2017)
Hypothesis: Focus on three complementary areas:
- Core datasets: Solidify and expand our core datasets using our own tooling to automate this in order to eat our own dog food, attract datars and validate consumer hypothesis
- Develop workflows inside DataHub that make this cheap and automated (esp for datasets that change)
Outcome: 70% - core dataset automation so far is only about 10% done. Current priority and by end of year we may have advanced substantially towards what we want.
Available to users but undocumented
- Marketing / research: engage in lean marketing to test and validate product offers by putting out “tasters” for main business offerings to see who is interested (e.g. enterprise, request data, FD consulting …)
- Do surveys, write up vision and use to inform how we write to people
Outcome: 30%.
- Request data done
- Pricing done
- Did not do FD consulting and enterprise.
- Blog posts and rising traffic
- [SECONDARY] Push - datars make it simple and easy for datars to push and pull data automatically and charge for private datasets plus storage over a certain level
Outcome: 80%.
- Implemented pricing
- Implemented automated pull of data (but not super well promoted)
Extras
- We implemented data validation so that we can validate it as a product
- We completely rebuilt our data factory which massively improved performance
Appendix
Discussion 3 Sep 2017 - Rufus + Adam
“Pipelines” can be used for:
- cleaning: “…”
- transform: “normal pipelines”
- export aka “dump”
Discussion:
- Where do we focus in the data flow (for now)
- Do we focus on clean, on transform or export?
- Adam: there isn’t a lot of benefit of pushing a static data file directly to bigquery - it’s not that interesting …, it’s a bit trivial
- Rufus: good point, and partly that’s why we emphasise showcase so much?
- what’s gives value to users and what can we get money from?
- Rufus: it is a tough sell for people to get automated cleaning - because they have to understand and adopt your system
- Proposal
- Experiment and evaluate with “clean/transform” automation
- Automate: cleaning 5-10 core datasets
- Review the core datasets - do they require regular updating, how does update currently work
- => Product definition on left hand side: UI for a automated wrangling system (that would allow us to compare this to other potential product visions …)
- Export: npm, zip, sqlite, rds
- Fix R, python integration instructions …
- Performance and tracking: work to speed things up etc, track usage, gather metrics about what is working …
- Marketing: …
- Experiment and evaluate with “clean/transform” automation
Product ideas
- Export to your own RDS …
- Metaphor of “deployment”
If we want to use pipelines for cleaning:
- DSL to define clean operation (delete first two rows, do this regex etc)
- ?? - has significant limits - not clear what 80/20 looks like (?) - “this way lies madness” ;-)
- “Dockerize” or DPP-ize the operation: allow people to write code either any kind of code (docker) or new pipelines that: raw data -> data package
- => registry for “processors” (be they docker or a pipelines)
Golden path
- Simple cleaning operations DSL e.g. Borrow from unix shell
- Allow for snippets (leading to full blown pipelines)
- (middle way) defined limited API