Labs was bristling with discussion and creation this week, with major improvements to two projects, interesting conversations around a few others, and an awesome new blog post.
Data Pipes: lots of improvements
Data Pipes is a Labs project that provides a web API for a set of simple data-transforming operations that can be chained together in the style of Unix pipes.
This past week, Andy Lulham has made a huge number of improvements to Data Pipes. Just a few of the new features and fixes:
- new operations:
strip
(removes empty rows),tail
(truncate dataset to its last rows) - new features: a
range
function and a âcomplementâ switch forcut
; options forgrep
- all operations in pipeline are now trimmed for whitespace
- basic tests have been added
Have a look at the closed issues to see more of what Andy has been up to.
Webshot: new homepage and feature
Last week we introduced you to Webshot, a web API for screenshots of web pages.
Back then, Webshotâs home page was just a screenshot of GitHub. Now Webshot has a proper home page with a form interface to the API.
Webshot has also added support for full page screenshots. Now you can capture the whole page rather than just its visible portion.
On the blog: natural language processing with Python
Labs member Tarek Amr has contributed an awesome post on Python natural language processing with the NLTK toolkit to the Labs blog.
âThe beauty of NLP,â Tarek says, âis that it enables computers to extract knowledge from unstructured data inside textual documents.â Read his post to learn how to do text normalization, frequency analysis, and text classification with Python.
Data Packages workflow Ă la Node
Wouldnât it be nice to be able to initialize new Data Packages as easily as you can initialize a Node module with npm init
?
Max Ogden started a discussion thread around this enticing idea, eventually leading to Rufus Pollock booting a new repo for dpm, the Data Package Manager. Check out dpmâs Issues to see what needs to happen next with this project.
Nomenklatura: looking forward
Nomenklatura is a Labs project that does data reconciliation, making it possible âto maintain a canonical list of entities such as persons, companies or event streets and to match messy input, such as their names, against that canonical listâ.
Friedrich Lindenberg has noted on the Labs mailing list that Nomenklatura has some serious problems, and he has proposed âa fairly radical re-framing of the serviceâ.
The conversation around what this re-framing should look like is still underwayâcheck out the discussion thread and jump in with your ideas.
Data Issues: following issues
Last week, the idea of Data Issues was floated: using GitHub Issues to track problems with public datasets. The idea has generated a few comments, and weâd love to hear more.
Discussion on the Labs list highlighted another benefit of using GitHub. Alioune Dia suggested that Data Issues should let users register to be notified when a particular issue is fixed. But Chris Mear pointed out that GitHub already makes this possible: âAny GitHub user can âfollowâ a specific issue by using the notification button at the bottom of the issue page.â
Get involved
Anyone can join the Labs community and get involved! Read more about how you can join the community and participate by coding, wrangling data, or doing outreach and engagement. Also check out the Ideas Page to see whatâs cooking in the Labs.