Labs was bristling with discussion and creation this week, with major improvements to two projects, interesting conversations around a few others, and an awesome new blog post.
Data Pipes: lots of improvements
Data Pipes is a Labs project that provides a web API for a set of simple data-transforming operations that can be chained together in the style of Unix pipes.
This past week, Andy Lulham has made a huge number of improvements to Data Pipes. Just a few of the new features and fixes:
- new operations:
strip(removes empty rows),
tail(truncate dataset to its last rows)
- new features: a
rangefunction and a “complement” switch for
cut; options for
- all operations in pipeline are now trimmed for whitespace
- basic tests have been added
Have a look at the closed issues to see more of what Andy has been up to.
Webshot: new homepage and feature
Last week we introduced you to Webshot, a web API for screenshots of web pages.
Back then, Webshot’s home page was just a screenshot of GitHub. Now Webshot has a proper home page with a form interface to the API.
Webshot has also added support for full page screenshots. Now you can capture the whole page rather than just its visible portion.
On the blog: natural language processing with Python
“The beauty of NLP,” Tarek says, “is that it enables computers to extract knowledge from unstructured data inside textual documents.” Read his post to learn how to do text normalization, frequency analysis, and text classification with Python.
Data Packages workflow à la Node
Wouldn’t it be nice to be able to initialize new Data Packages as easily as you can initialize a Node module with
Max Ogden started a discussion thread around this enticing idea, eventually leading to Rufus Pollock booting a new repo for dpm, the Data Package Manager. Check out dpm’s Issues to see what needs to happen next with this project.
Nomenklatura: looking forward
Nomenklatura is a Labs project that does data reconciliation, making it possible “to maintain a canonical list of entities such as persons, companies or event streets and to match messy input, such as their names, against that canonical list”.
The conversation around what this re-framing should look like is still underway—check out the discussion thread and jump in with your ideas.
Data Issues: following issues
Last week, the idea of Data Issues was floated: using GitHub Issues to track problems with public datasets. The idea has generated a few comments, and we’d love to hear more.
Discussion on the Labs list highlighted another benefit of using GitHub. Alioune Dia suggested that Data Issues should let users register to be notified when a particular issue is fixed. But Chris Mear pointed out that GitHub already makes this possible: “Any GitHub user can ‘follow’ a specific issue by using the notification button at the bottom of the issue page.”
Anyone can join the Labs community and get involved! Read more about how you can join the community and participate by coding, wrangling data, or doing outreach and engagement. Also check out the Ideas Page to see what’s cooking in the Labs.